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Abstract 


This paper is focused on studying the view-manifold structure in the feature spaces 
implied by the different layers of Convolutional Neural Networks (CNN). There 
are several questions that this paper aims to answer: Does the learned CNN rep¬ 
resentation achieve viewpoint invariance? How does it achieve viewpoint invari¬ 
ance? Is it achieved by collapsing the view manifolds, or separating them while 
preserving them? At which layer is view invariance achieved? How can the struc¬ 
ture of the view manifold at each layer of a deep convolutional neural network 
be quantified experimentally? How does fine-tuning of a pre-trained CNN on a 
multi-view dataset affect the representation at each layer of the network? In order 
to answer these questions we propose a methodology to quantify the deformation 
and degeneracy of view manifolds in CNN layers. We apply this methodology and 
report interesting results in this paper that answer the aforementioned questions. 

1 Introduction 

Impressive results have been achieved recently with the application of Convolutional Neural Net¬ 
works (CNNs) in the tasks of object categorizations (Krizhevsky et al., 2012) and detection (Ser- 
manet et al., 2013; Girshick et al., 2013). Several studies recently investigated different properties 
of the learned representations at different layers of the network, e.g. (Yosinski et al., 2014; Zeiler 
& Fergus, 2013; Chatfield et al., 2014). One fundamental question is how CNN models achieve dif¬ 
ferent invariances. It is well understood that consecutive convolution and pooling layers can achieve 
translation invariant. Training CNN networks with a large dataset of images, with arbitrary view¬ 
points and arbitrary illumination, while optimizing the categorization loss helps to achieve viewpoint 
invariant and illumination invariant. 

In this paper we focus on studying the viewpoint invariant properties of CNNs. In many applications, 
it is desired to estimate the pose of the object, for example for robot manipulation and scene under¬ 
standing. Estimating pose and object categorization are tasks that contradict each other; estimating 
pose requires a representation capable of capturing the viewpoint variance, while viewpoint invari¬ 
ance is desired for categorization. Ultimately, the vision system should achieve a representation that 
can factor out the viewpoint for categorization and preserve viewpoint for pose estimation. 

The biological vision system is able to recognize and categorize objects under wide variability in 
visual stimuli, and at the same time is able to recognize object pose. It is clear that images of the same 
object under different variability, in particular different views, he on a low-dimensional manifold in 
the high-dimensional visual space defined by the retinal array (^100 million photoreceptors and 
~1 million retinal ganglion cells). DiCarlo & Cox (2007) hypothesized that the ability of our brain 
to recognize objects, invariant to different viewing conditions, such as viewpoint, and at the same 
time estimate the pose, is fundamentally based on untangling the visual manifold encoded in neural 
population in the early vision areas (retinal ganglion cells, LGN, VI). They suggested that this is 
achieved through a series of successive transformation (re-representation) along the ventral stream 
(VI,V2, V4, to IT) that leads to an untangled population at IT. Despite this, it is unknown how the 
ventral stream achieves this untangling. They argued that since IT population supports tasks other 
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than recognition, such as pose estimation, the manifold representation is some how 'flattened' and 
'untangled' in the IT layer. DiCarlo and Cox’s hypothesis is illustrated in Figure 1. They stress that 
the feedforward cascade of neural re-representation is a way to untangle the visual manifold. 

Inspired by recent impres¬ 
sive results of CNNs and 
by DiCarlo and Cox’s hy¬ 
pothesis (DiCarlo & Cox, 

2007) on manifold untan¬ 
gling, this paper focuses on 
studying the view-manifold 
structure in the feature 
spaces implied by the dif¬ 
ferent layers of CNNs. 

There are several ques¬ 
tions that this paper aims 
to answer: 1 . Does the 
learned CNN representa¬ 
tions achieve viewpoint in¬ 
variance? If so, how does 
it achieve viewpoint invari¬ 
ance? Is it by collapsing the view manifolds, or separating them while preserving them? At which 
layer is the view invariance achieved? 2. How to experimentally quantify the structure of the view¬ 
point manifold at each layer of a deep convolutional neural network? 3. How does hne-tuning of a 
pre-trained CNN, optimized for categorization, on a multi-view dataset, affect the representation at 
each layer of the network? 

In order to answer these questions, we present a methodology that helps to get an insight about 
the structure of the viewpoint manifold of different objects as well as the combined object-view 
manifold in the layers of CNN. We conducted a series of experiments to quantify the ability of 
different layers of a CNN to either preserve the view-manifold structure of data or achieve a view- 
invariant representation. 

The contributions of the paper are as follows: (1) We propose a methodology to quantify and get 
insight into the manifold structures in the learned representation at different layers of CNNs. (2) 
We use this methodology to analyze the viewpoint manifold of pre-trained CNNs. (3) We study the 
effect of transfer learning a pre-trained network with two different objectives (optimizing category 
loss vs. optimizing pose loss) on the representation. (4) We draw important conclusions about the 
structure of the object-viewpoint manifold and how it coincides with DiCarlo and Cox’s hypothesis. 

The paper begins by reviewing closely related works. Section 3 dehnes the problem, experimental 
setup, and the basic CNN network that our experiments are based upon. Section 4 introduces our 
methodology of analysis. Sections 5 and 6 describe the hndings on the pre-trained network and the 
hne-tuned networks respectively. The conclusion section summarizes our hndings. 



Figure 1: Illustration of DiCarlo and Cox model (DiCarlo & Cox, 
2007): Left: tangled manifolds of different objects in early vision 
areas. Right: untangled (battened) manifold representation in IT 


2 Related Work 


LeCun et al has widely used CNNs for various vision tasks (Sermanet et al., 2013; Kavukcuoglu 
et al., 2010; Jarrett et al., 2009; Ranzato et al., 2007; LeCun et al., 2004). The success of CNNs 
can be partially attributed to these efforts, in addition to training techniques that have been adopted. 
Krizhevsky et al. (2012) used a CNN in the ImageNet Challenge 2012 and achieved state-of-the-art 
accuracy. Since then, there have been many variations in CNN architectures and learning techniques 
within different application contexts. In this section we mainly emphasize related works that focused 
on bringing an understanding of the representation learned at the different layers of CNNs and 
related architectures. 

Yosinski et al. (2014) studied how CNN layers transition from general to specific. An important 
Ending in this study is that learning can be transferred, and by using hne-tuning, performance is 
boosted on novel data. Other transfer learning examples include (Razavian et al., 2014; Donahue 
et al., 2013; Agrawal et al., 2014). Zeiler & Eergus (2013) investigated the properties of CNN 
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layers for the purpose of capturing object information. This study is built on the premise that there 
is no coherent understanding of why CNNs work well or how we can improve them. Interesting 
visualizations were used to explore the functions of layers and the intrinsics of categorization. The 
study stated that CNN output layers are invariant to translation and scale but not to rotations. The 
study in (Chatheld et al., 2014) evaluated different deep architectures and compared between them. 
The effect of the output-layer dimensionality was explored. 


3 Problem Definition and Experimental Setup 


It is expected that multiple views of an object he on intrinsically low-dimensional manifolds (view 
manifold^) in the input space. View manifolds of different instances and different objects are spread 
out in this input space, and therefore form jointly what we call the object-view manifold. The input 
space here denotes the space induced by an input image of size N x M, which is analogous 

to the retinal array in the biological system. For the case of a viewing circle(s), the view manifold of 
each object instance is expected to be a 1-dimensional closed curve in the input space. The recovery 
of the category and pose of a test image reduces to Ending which of the manifolds this image belongs 
to, and what is the intrinsic coordinate of that image within that manifold. This view of the problem 
is shared among manifold-based approaches such as (Murase & Nayar., 1995; Zhang et al., 2013; 
Bakry & Elgammal, 2014) 


The ability of a vision system to recover the viewpoint is directly related to how the learned rep¬ 
resentation preserves the view manifold structure. If the transformation applied to the input space 
yields a representation that results in collapsing the view manifold, the system will no longer be able 
to discriminate between different views. Since each layer of a deep NN re-represents the input in a 
new feature space, the question would be how the re-representations deform a manifold that already 
exists in the input space. A deep NN would satisfy the hypothesis of ’flattening ’ and ’untangling ’ 
by DiCarlo & Cox (2007), if the representation in a given layer separates the view manifolds of 
different instances, without collapsing them, in a way to be able to put a separating hyperplanes be¬ 
tween different categories. Typically CNN layers exhibit general-to-specihc feature encoding, from 
Gabor-like features and color blobs at low layers to category-specihc features at higher layers (Zeiler 
& Fergus, 2013). We can hypothesize that for the purpose of pose estimation, lower layers should 
hold more useful representations that might preserve the view manifold and be better for pose esti¬ 
mation. But which of these layers would be more useful, and where does the view-manifold collapse 
to view-invariance. 



There are different hypotheses we 
can make about how the view mani¬ 
folds of different objects are arranged 
in the feature space of a given layer. 

These hypotheses are shown in Fig¬ 
ure 2. We arrange these hypothe¬ 
ses based on linear separability of the 
different objects’ view manifolds and 
the preservation of the view mani¬ 
folds. Case 0 is the non-degenerate 
case where the visual manifolds pre¬ 
serve the pose information but are 
tangled and there is no linear sepa¬ 
ration between them (this might re¬ 
semble the input space, similar to left 
case in Figure 1). Case 1 is the ulti¬ 
mate case where the view manifolds 
of different objects are preserved by 
the transformation and are separable (similar to the right case in Figure 1). Case 2 is where the 
transformation in the network leads to separation of the object’s view manifold at the expense of 
collapsing these manifolds to achieve view invariance. Collapsing of the manifolds can be to differ¬ 
ent degrees, to the point where each object’s view manifold can be mapped to a single point. Case 3 


Figure 2: Sketches of four hypotheses about possible structures 
of the view manifolds of two objects in a given feature space. 


^ we use the terms view manifold and viewpoint manifold interchangeably 
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Figure 3: KNN Tradeoffs: accuracy tradeoff between category and pose estimation using KNN. This cartoon 
illustrates the global measurements, see Section 4.2 for full details. 

is where the transformation results in more tangled manifolds (pose collapsing and non-separable). 
It is worth to notice that both cases 1 and 2 are view invariant representations. However, it is obvi¬ 
ous that case 1 would be preferred since it also facilitates pose recovery. It is not obvious whether 
optimizing a network with a categorization loss result in case 1 or case 2. Getting an insight about 
which of these hypotheses are true in a given layer of a CNN is the goal of this paper. In Section 4 
we propose a methodology to get us to that insight. 


3.1 Experimental Settings 

To get an insight into the representations of the different layers and answer the questions posed 
in Section 1 we experiment on two datasets: I) RGB-D dataset (Lai et al., 2011), II) Pascal3D-F 
dataset (Xiang et al., 2014). We selected the RGB-D dataset since it is the largest available multiview 
dataset with the most dense viewpoint sampling. The dataset contains 300 instances of tabletop 
objects (51 categories). Objects are set on a turntable and captured by an Xbox Kinect sensor 
(Kinect 2010) at 3 heights (30°, 45° and 60° elevation angles). The dense view sampling along each 
height is essential for our study to guarantee good sampling of the view manifold. We ignore the 
depth channel and only used the RGB channels. 

Pascal3D-F is very challenging because it consists of images “in the wild”, in other words, images 
of object categories exhibiting high variability, captured in uncontrolled settings and under many 
different poses. Pascal3D-F contains 12 categories of rigid objects selected from the PASCAL VOC 
2012 dataset (Everingham et al., 2010). These objects are annotated with 3D pose information {i.e, 
azimuth, elevation and distance to camera). Pascal3D-F also adds 3D annotated images of these 12 
categories from the ImageNet dataset (Deng et al., 2009). The bottle category is omitted in state- 
of-the-art results. This leaves 11 categories to experiment with. There are about 11,500 and 7,000 
training images in ImageNet and Pascal3D-F subsets, respectively. For testing, there are about 11,200 
and 6,900 testing images for ImageNet and Pascal3D-F, respectively. On average there are about 
3,000 object instances per category in Pascal3D-F, making it a challenging dataset for estimating 
object pose. 

The two datasets provide different aspect of the analysis. While the RGB-D provides dense sampling 
of each instance’s view manifold, Pascal3D-F dataset contains only very sparse sampling. Each 
instance is typically imaged from a single viewpoint, with multiple instances of the same category 
sampling the view manifold at arbitrary points. Therefore, in our analysis we use the RGB-D dataset 
to analyze each instance viewpoint manifold and the combined object-viewpoint manifolds, while 
the Pascal3D provides analysis of the viewpoint manifold at the category level. 

Evaluation Split: For our study, we need to make sure that the objects we are dealing with have 
non-degenerate view manifolds. We observed that many of the objects in the RGB-D dataset are 
ill-posed, in the sense that the poses of the object are not distinct. This happens when the objects 
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have no discriminating texture or shape to be able to identify the different poses (e.g. a texture-less 
bah, apple or orange on a turntable). This will cause view manifold degeneracy. Therefore we select 
34 out of the 51 categories as objects that possess pose variations across the viewpoints, and thus 
are not ill-posed with respect to pose estimation. 

We split the data into training, validation and testing. Since in this datasets, most categories have 
few instances, we left out two random object instances per category, one for validation and one for 
testing. In the case where a category has less than 5 instances, we form the validation set for that 
category by randomly sampling from the training set. Besides the instance split, we also left out ah 
the middle height for testing. Therefore, the testing set is composed of unseen instances and unseen 
heights and this allows us to more accurately evaluate the capability of the CNN architectures in 
discriminating categories and estimating pose of tabletop objects. 

3.2 Base Network: ModelO 

The base network we use is the Convolutional Neural Network described in Krizhevsky et al. (2012) 
and winner of LSVRC-2012 ImageNet challenge (Russakovsky et al., 2014). The CNN was com¬ 
posed of 8 layers (including 1000 neuron output layer corresponding to 1000 classes). We call these 
layers in order: Convl, Pooll, Conv2, Pool2, Conv3, Conv4, Conv5, Pool5, FC6, FC7, FC8 where 
Pool indicates Max-Pooling layers, Conv indicates layers performing convolution on the previous 
layer and FC indicates fully connected layer. The last fully connected layer (FC8) is fed to a 1000- 
way softmax, which produces a distribution over the category labels of the dataset. 


4 Methodology 

The goal of our methodology is two-folds: (1) study the transformation that happens to the viewpoint 
manifold of a specihc object instance at different layers, (2) study the structure of the combined 
object-view manifold at each layer to get an insight about how tangled or untangled the different 
objects’ viewpoint manifolds are. Both these approaches will get us an insight to which of the 
hypotheses explained in Section 3 is correct at each layer, at least relatively by comparing layers. 
This section introduces our methodology, which consists of two sets of measurements to address 
the aforementioned two points. First, we introduce instance-specihc measurements that quantify 
the viewpoint manifold in the different layers to help understand whether the layers preserve the 
manifold structure. We performed extensive analysis on synthetic manifold data to validate the 
measures, see Appendix C. Second, we introduce empirical measurements that are designed to draw 
conclusions about the global object-viewpoint manifold (involving all instances). 

4.1 Instance-Specific View Manifold Measurements 

Let us denote the input data (images taken from a viewing circle and their pose labels) for a specihc 
object instance as {{xi G G [0,27r]),i = 1 • • • A^}, where D denotes the dimensionality 

of the input image to the network, and N is the number of the images, which are equally spaced 
around the viewing circle. These images form the view manifold of that object in the input space 
denoted hy M = {xi}i . Applying each image to the network will result in a series of nonlinear 
transformations. Let us denote the transformation from the input to layer I by the function fi{x) : 

^ where di is the dimensionality of the feature space of layer 1. With an abuse of notation 
we also denote the transformation that happens to the manifold M. at layer I hy = fi{M) = 
After centering the data by subtracting the mean, let = [fi{xi) • • • //(xat)] be the 
centered feature matrix at layer I of dimension dixN, which corresponds to the centered transformed 
images of the given object. We call the sample matrix in layer 1. 

Since the dimensionality di of the feature space of each layer varies, we need to factor out the effect 
of the dimensionality. Since N di the transformed images on ah the layers he on subspaces 
of dimension N in each of the feature spaces. Therefore, we can change the bases to describe the 
samples using N dimensional subspace, i.e, we dehne the A x A matrices A^ = U^A where 
U G ^ ^ are the orthonormal bases spanning the column space of A^ (which we can get by 

SVD of A^ = USV^). This projection rotates the samples at each layer without changing the 
manifold geometric or neighborhood properties. Then the following measures will be applied to 
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the N transformed images, representing the view manifold of each object instance individually. 
To obtain an overall measures for each layer we will average these measures over ah the object 
instances. 

1) Measure of spread - Nuclear Norm: There are several possible measures of the spread of the data 
in the sample matrix of each view manifold. We use the nuclear norm (also known as the trace 
norm (Horn)) dehned as ||A||* = Tr{V A^A) = Ylf=i (7i, i.e, it measures the sum of the singular 
values of A. 

2) Subspace dimensionality measure - Effective-p: counts the effective dimensionality of the sub¬ 

space where the view manifold lives. Smaller number means that the view manifold lives in 
lower dimensional subspace. We dehne Mfective-p as the minimum number of singular val¬ 
ues (in decreasing order) that sum up to more that or equal to p% of the nuclear norm, i.e. 
Effective — p = sup{n : ^ p/100}. 

3) Alignment Measure - KTA: Ideally the view manifold resulting of the view sitting of the stud¬ 
ied datasets is a single-dimensional closed curve in the feature space, which can be thought as a 
deformed circle (Zhang et al., 2013). This manifold can be degenerate in the ultimate case to a 
single point in case of a texture-less object. The goal of this measurement is to quantify how the 
transformed manifold locally preserves the original manifold structure. To this end we compare the 
kernel matrix of the transformed manifold at layer /, denote by K^, with the kernel matrix of the 
an embedding of the ideal view manifold on unit circle, denote by K°, where n indicates the local 
neighborhood size used in constructing the kernel matrix. We construct the neighborhood based on 
pose labels. 

Given these two kernel matrices we can dehne several convergence measures. We use kernel Target 
Alignment (KTA) which has been used in the literature for kernel learning (H et al., 1996). It hnds 
a scale invariant dependency between two normalized kernel matrices^. Therefore, we dehne the 
alignment of the transformed view manifold at layer I with the ideal manifold as KTA^ (M^) = 
<K^,K° >^/(||K^||f||K°||^). 

4) KPLS-regression measures: Kernel Partial Least Squares (KPLS) (Rosipal & Trejo, 2002) is a 
supervised regression method. KPLS iteratively extracts the set of principal components of the input 
kernel that are most correlated with the output . We use KPLS to learn mapping ^ K° from 
the transformed view manifold kernel (input kernel) to the unit circle kernel (output kernel). We 
enforce this mapping to use maximum ofd<^N principal components (we used d = 5). Then 
we dehne KPLS-Regression Error, which uses the Normalized Cross Correlation to quantify the 
mapping correctness. 

5) TPS-linearity measure: In this measure we learn a regularized Thin Plate Spline (TPS) non¬ 
linear mapping (Duchon, 1977) between the unit circle manifold and each AiK The reason for 
using TPS in particular is that the mapping has two parts: affine (linear polynomial) and nonlinear 
part. Analysis of the two parts will tell us if the mapping is mostly linear or nonlinear. We use the 
reciprocal-condition number (rcond) of the sub coefficient matrices corresponding to the affine and 
the nonlinear part as a measure of the linearity of the transformation. ^ 

4.2 Global Object-Viewpoint Manieold Measures 

To achieve an insight about the global arrangement of the different objects’ view-manifolds in a 
given feature (layer) space, we use the following three empirical measurements: 

6) Local Neighborhood Analysis: To evaluate the local manifold structure we also evaluate the 
performance of nearest neighbor classihers for both category and pose estimation, with varying 
size of the neighborhood. This directly tell us whether the neighbors of a given point are from 
the same category and/or of similar poses. KNN for categorization cannot tell us about the linear 
separability of classes. However evaluating the pose estimation in neighborhood of a datapoint gives 

^We also experimented with HSIC (Gretton et al., 2005b), however HSIC is not scale invariant and not 
designed to compare data in different feature spaces. Therefore, HSIC did not give any discriminative signal 
^More details and definitions about KPLS and TPS based measurements in Appendix D. 


6 





Published as a conference paper at ICLR 2016 


us an insight about how the view manifolds are preserved, and even whether the view manifolds of 
different instances are aligned. To achieve this insight we use two different measurements: KNN- 
Accuracy: the accuracy of KNN classihers for category and pose estimation. KNN-Gap: the drop 
in performance of each KNN classiher as the neighborhood size increases. In our experiments 
we increase K from 1 to 9. Positive gap indicates a drop (expected) and negative gap indicates 
improvement in performance. 

The interaction between these two measures and how they tell us about the manifold structure is 
illustrated in Fig 3. The contrast between the accuracy of the KNN classihers for pose and category 
directly implies which of the hypotheses in Figure 2 is likely. The analysis of KNN-Gap (assuming 
good 1-NN accuracy) gives further valuable information. As the KNN-gap reaches zero in both 
category and pose KNN classihers, this implies that neighborhoods are from the same category and 
has the same pose, which indicates that the representation aligns the view manifolds of different 
instances of the same category. If the view manifolds of such instances are preserved and separated 
in the space, and the neighbors of a given point are from the same instance, this would imply small 
gap in the category KNN classiher and bigger gap in pose KNN classiher. Low gap in pose KNN vs 
high gap in category CNN implies the representation aligns view manifolds of instances of different 
categories. A high gap in both obviously implies the representation is tangling the manifolds such 
that a small neighborhood contains points from different categories and different poses. Notice that 
this implications are only valid when the 1-NN accuracy is high. 

7) L-SVM: For a test image x transformed to the l-th layer’s feature space, fi{x), we compute 
the performance of a linear SVM classiher trained for categorization. Better performance of such 
a classiher directly implies more linear separability between different view manifolds of different 
categories. 

8) Kernel Pose Regression: To evaluate whether the pose information is preserved in a local neigh¬ 
borhood of a point in a given feature space we evaluate the performance of kernel ridge regression 
for the task of pose estimation. Better performance implies better pose-preserving transformation, 
while poor performance indicates pose-collapsing transformation. The combination of L-SVM and 
kernel regression should be an indication to which of the hypotheses in Figure 2 is likely to be true. 

5 Analysis of the Pre-trained Network 
5.1 Instance View Manifold Analysis 

Eigure 4 shows the application of the instance-specihc view manifold measurements on the images 
of the RGBD dataset when applied to a pre-trained network (ModelO - no hne-tuning). This gives 
us an insight on the transformation that happens to the view manifold of each object instance at each 
layer of the network. Eigure 4a shows that the nuclear norm of the transformed view manifolds in 
ModelO is almost monotonicahy decreasing as we go higher in the network, which indicates that the 
view manifolds is more spread in the lower layers. In fact at the output layer of ModelO the nuclear 
norm becomes too small, which indicates that the view manifold is collapsing to reach view invariant 
representation at this layer. Eigure 4b (p = 90%) shows that subspace dimension varies within a 
small range in the lower layers and it reduces dramatically in fully connected layers, which indicates 
that the network tries to achieve view invariance. The minimum is achieved at ECS (even without 
hne tuning). Eigure 4c shows the KTA applied to ModelO, where we can notice that the alignment is 
almost similar across the lower layers, with Pool5 having the maximum alignment, and then starts 
to drop at the very high layers, which indicates that after Pool5, the PC layers try to achieve view 
invariant. Pig 4d shows that KPLS regression error on ModelO dramatically reduces from PCS down 
to Pools, where Pool5 has the least error. In general the lower layers have less error. This indicates 
that the lower layers preserve higher correlation with the ideal manifold structure. Pig 4e shows that 
the mapping is highly linear, which is expected because of the high dimensionality of the feature 
spaces. Prom Pig 4e we can clearly notice that the lower layers has more better-conditioned linear 
mapping (plots for the nonlinear part is in Appendix D.) 

Prom these measurements we can conclude: (1) The lower layers preserve the view manifolds. The 
manifolds start to collapse in the PC layers to achieve view invariance. Preserving the view manifold 
at the lower layers is intuitive because of the nature of the convolutional layers. (2) The manifold 
at Pools achieves the best alignment with the pose labels. This is a less intuitive result; why does 
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Figure 4: RGB-D: Local Measurement analysis for the view-manifold. Every figure shows single measure¬ 
ment for three models (ModelO, Model 1 Cat and Model 1 Pose) at different layers. 

the representation after successive convolutions and pooling improves the view manifold alignment? 
even without seeing any dense view manifold in training, and even without any pose labels being 
involved in the loss. The hypotheses we have to justify that Pool5 has better alignment than the 
lower layers is that Pool5 has better translation invariant properties, which results in improvement 
of the view manifold alignment. 

5.2 Global Object-View Manifold Analysis 

To study view-manifold in the network layers. Figure 5 shows the KNN accuracy for pose and 
category within training split, no test is used in this experiment. The category gap is reducing 
as we go up in the network up to FC7 (almost 0 gap at FC6 and FC7). In contrast the gap is 
large at all layers for pose estimation. This indicates separation of the instances’ view manifolds 
where the individual manifolds are not collapsed (This is why as we increase the neighborhood, the 
category performance stays the same while pose estimation decreases smoothly - See Figure 3-right 
for illustration). The results above consistently imply that the higher layers of CNN (expect FC8 
which is task specific), even without any fine-tuning on the dataset, and even without any pose label 
optimization achieve representations that separate and highly preserve the view manifold structure. 

The aforementioned con¬ 
clusion is also confirmed 
by the test performance of 
Linear SVM and Kernel 
Regression in Figure 6, us¬ 
ing RGBD dataset. In this 
experiment, the models are 
learned in train-split and 
the plots generated using 
test-split. Figure 6 clearly 
shows the conflict in the 
representation of the pre¬ 
trained network (categorization increases and pose estimation decreases). Linear separability of 
category is almost monotonically increasing up to FC6. Linear separability in FC7 and FC8 is 
worse, which is expected as they are task specific (no fine-tuning). Surprisingly Pooll features per¬ 
form very bad, despite being the most general features (typically they show Gabor like features and 
color blobs). In contrast, for pose estimation, the performance increases as we go lower in the net¬ 
work up to Conv4 and then slightly decreases. This confirms our hypothesis that lower layers offer 
better feature encoding for pose estimation. It seems that Pool5 provides feature encoding that offer 
the best compromise in performance, which indicates that it is the best in compromising between 
the linear separation of categories and the preservation of the view-manifold structure. 



Figure 5: RGB-D: KNN for categorization and pose estimation over the lay¬ 
ers of pre-trained model (ModelO). For K = {1,3, 5, 7,9} 
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Surprisingly, the pose estimation results do not drop dramatically in FC6 and FC7. We can still 
estimate the pose (with accuracy around 63%) from the representation at these layers, even without 
any training on pose labels. This highly suggests that the network preserves the view manifold 
structure to some degree. For examples taking the accuracy as probability at layer FC6, we can 
vaguely conclude that 90% of the manifolds are linearly separable and 65% are pose preserved (we 
are somewhere between hypotheses 1 and 2 at this layer). 


Table 1 shows the quanti¬ 
tative results of our mod¬ 
els on Pascal3D-F dataset. 

It also shows comparison 
against two previous meth¬ 
ods (Zhang et al., 2015) 
and (Xiang et al., 2014), 
using the two metrics < 

45° and < 22.5° It 
is important to note that 
the comparison with (Xi¬ 
ang et al., 2014) is unfair 
because they solve for de¬ 
tection and pose simultaneously while we solve for categorization and pose estimation. Model 1 
here outperforms both baselines (despite the unfair comparison with the latter approach). Quantita¬ 
tive results on RGBD dataset is presented in Appendix B. 



Figure 6: RGB-D: test performance of linear SVM category classification 
over the layers of different models (Left), and pose regression (Right). 


Approach 

Categorization % 

Pose (AAAI metric %) 

Pose (other metrics %) 

ModelO (SVM/Kernel Regression) 

FC6/FC7/FC8 

73.64/76.38/71.13 

FC6/FC7/FC8 

49.72/48.24/45.41 


Modell (SVM/Kernel Regression) 

74.65/79.25/84.12 

54.41/54.07/60.31 


ModelO NN 

60.05/69.89/61.26 

61.11/61.38/60.32 


Modell NN 

73.50/77.30/83.07 

65.87/66.07/70.54 


Modell (final prediction) 

84.00 

71.60 

47.34«22.5), 61.30 «45) 

(Zhang et al., 2015) 

- 

- 

44.20 « 22.5), 59.00 «45) 

(Xiang et al., 2014) 

- 

- 

15.6 «22.5), 18.7 «45) 


Table 1: Pascal3D Performance computed for ModelO and Model 1 using different classification techniques. 
Comparsion indicates that Model 1 outperforms the baselines. 


6 Effect of Transfer Learning 

In order to study the effect of fine-tuning the network (transfer learning to a new dataset) on the 
representation we trained the following model (denoted as Model 1). This architecture consists of 
two parallel CNNs: one with category output nodes (Model 1-Cat), and one with binned pose out¬ 
put nodes (Model 1-Pose). We used 34 and 11 category nodes for RGBD and Pascal3D datasets 
respectively; while we used 16 pose nodes for both datasets). The parameters of both CNNs were 
initialized by ModelO parameters up to PC7. The parameters connecting PC7 to the output nodes are 
randomly initialized on both networks and they are fine-tuned by minimizing the categorization loss 
for Model 1-Cat and the pose loss for Model 1-pose. The purpose of these architectures is to study 
the effect of fine-tuning when the category and pose are independently optimized. 

We applied all the measures described in Sec 4 to understand how the view manifolds will be affected 
after such tuning. The questions are: To what degree optimizing on category should damage the 
ability of the network to encode view manifolds. On the other hand, how optimizing on pose should 
enhance that ability. Model 1-Cat indicates the effect of optimizing on category, while Model 1-Pose 
indicates the effect of optimizing on pose. 

Pig 4 shows the five view manifold measures for the different layers of Modell(Cat/Pose), in com¬ 
parison with ModelO. In terms of data spread, from Pig 4a shows that the spread at PCS has doubled 
after fine tuning on pose (Model 1-Pose). Pig 4b shows the fine tuning on category (Model 1-Cat) 
caused the view manifold subspace dimensionality to significantly reduce to 1, where it became to¬ 
tally view invariant. Optimizing on pose slightly enlarged the subspace dimensionality {i.e, become 

"^Pose accuracy metrics are defined in (Zhang et al., 2015; Xiang et al., 2014) and stated in Appendix A 
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better) at FC8 and FC7. Fig 4c clearly shows the signihcant improvement achieved by hne tuning 
on pose, where the alignment of FC8 jumped to close to 0.9 from about 0.78, while hne tuning on 
category reduces the alignment of FC8 to close to 0.65. Similar behavior is also apparent in the 
KPLS ratio for FC8 and FC7 (sup-mat). 

One very surprising result is that optimizing on pose makes the pose KTA alignment worse at the 
lower layers, while optimizing on category makes the pose alignment better compared to modelO. 
In fact, although optimizing on pose signihcantly helps aligning FC8 with pose labels, Pool5 still 
achieves the best KTA alignment and the least regression reconstruction error. The regression re¬ 
construction error in Fig 4d clearly shows signihcant improvement in the representation of FC8 and 
FC7 to preserve the view manifold. One surprising hnding from these plots is that the representa¬ 
tion of FC6 becomes worse after hne tuning for both pose and category. Fig 4e indicates that the 
deformation of the view manifold is reduced as a result of hne tuning on pose (larger rcond number), 
while it increases as a result of hne tuning on category. 

On the global object-view manifold structure, we notice from Figures 6 some intuitive behavior at 
FC8. Basically optimizing on pose reduces the linear separability and increases the view manifold 
preservation (moves the representation towards hypothesis 0). In contrast, optimizing the category 
signihcantly improves the linear separability at FC8, however, interestingly, it only slightly reduces 
the pose estimation performance to be slightly less than 50%. Combining this conclusion with the 
observation from Fig 4b, that the view manifold subspace dimensionality reduces to 1, this implies 
that optimizing on category collapses the view manifolds to a hne, but they are not totally degen¬ 
erate. What is less obvious is the effect of hne tuning on the lower layers than FC8. Surprisingly, 
optimizing on pose did not affect the linear separability of FC7. Another very interesting observation 
is that optimizing on category actually improves the pose estimation slightly at the FC7, FC6, and 
Pool5; and did not reduce it at lower layers. This implies that hne tuning by optimizing on category 
only improved the internal view manifold preservation at the network, even without any pose labels. 

7 Conclusions 

In this paper we present an in-depth analysis and discussion of the view-invariant properties of 
CNNs. We proposed a methodology to analyze individual instance’s view manifolds, as well as the 
global object-view manifold. We applied the methodology on a pre-trained CNN, as well as two 
hne-tuned CNNs, one optimized for category and one for pose. We performed the analysis based 
on two multi view datasets (RGBD and PascalSD-F). Applications on both datasets give consistent 
conclusions. 

Based on the proposed methodology and the datasets, we analyzed the layers of the pre-trained 
and hne-tuned CNNs. There are several hndings from our analysis that are detailed throughout 
the paper, some of them are intuitive and some are surprising. We hnd that a pre-trained network 
captures representations that highly preserve the manifold structure at most of the network layers, 
including the fully connected layers, except the hnal layer. Although the model is pre-trained on 
ImageNet, not a densely sampled multi-view dataset, still, the layers have the capacity to encode 
view manifold structure. It is clear from the analysis that, except of the last layer, the representation 
tries to achieve view invariance by separating individual instances’ view manifolds while preserving 
them, instead of collapsing the view manifolds to degenerate representations. This is violated at the 
last layer which enforces view invariance. 

Overall, our analysis using linear SVM, kernel regression, KNN, combined with the manifold anal¬ 
ysis, makes us believe that CNN is a model that simulate the manifold flattening hypothesis of Di- 
Carlo & Cox (2007) even without training on multi-view dataset and without involving pose labels 
in the objective’s loss. 

Another interesting hnding is that Pool 5 offers a feature space where the manifold structure is still 
preserved to the best degree. Pool 5 shows better representation for the view-manifold than early 
layers like Pooll. We hypothesize that this is because Pool5 has better translation and rotation 
invariant properties, which enhance the representation of the view manifold encoding. 

We also showed the effect of hne-tuning the network on multi-view datasets, which can achieve very 
good pose estimation performance. In this paper we only studied the effect of independent pose and 
category loss optimization. Optimizing on category achieves view invariance at the very last fully 
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connected layers; interestingly it enhances the viewpoint preservation at earlier layers. We also hnd 
that hne-tuning mainly affects the higher layers and rarely affects the lower layers. 

In this work our goal is not to propose any new architecture or algorithm to compete with the state of 
the art in pose estimation. However, the proposed methodology can be used to guide deep network 
design for solving several tasks. To show that and based on the analysis and the conclusions of this 
paper, we introduced and studied in (Elhoseiny et al., 2015) several variants of CNN architectures 
for joint learning of pose and category, which outperform the state of the art. We keep these results 
as a guide to the reviewers, without distracting the reader from our main goal. 

Acknowledgment: This work is funded by NSF-USA award # 1409683. 
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Approach 

Categorization % 

Pose % 

HOG (SVM/Kemel Regression) 

80.26 

27.95 (AAAI) 

ModelO (SVM/Kemel Regression) on conv4 

58.64 

67.39 (AAAI) 

ModelO (SVM/Kemel Regression) on FC6 

86.71 

64.39 (AAAI) 

Model 1 

89.63 

81.21 (AAAI), 69.58 « 22.5), 81.09 « 45) 


Table 2: RGBD Dataset Results for HOG, ModelO and Model 1. 

Appendix 

A Pose and Categorization Pereormance on RGBD and PascalSD 

DATASETS 


The two metrics < 22.5 and < 45 are the percentages of test samples that satisfy AE < 22.5° 
and AE < 45°, respectively where the Absolute Error (AE) is AE = \EstimatedAngle — 
GroundTruth\). The AAAI pose metric is dehned as 


A{0i,0j) = min{\0i — 0j\^27T — \0i — OjD/ir 


( 1 ) 


B Quantitative results eor RGBD Dataset 

Table 2 shows qualitative ressult of Kemel-SVM Regression on different layers features in ModelO 
and Model 1. Comparing their results with images HOG features. Comparison to state of Art using 
this model is not possible since we work on the wehposed objects split which is not explored by any 
other work. 

C Synthetic Data Analysis 

In this work, we have explored many different measurements and hlter them out to use only those 
that expose the correct properties of the view-manifolds. Besides the intuitive reasoning that we 
provided for choosing the measurements, in this section, we show empirical results to quantify 
efficiency of the chosen measurements. 

To this end, we synthesized a set of well designed view-manifolds. Analyzing these manifolds is 
intended to identify the robust and informative set of measurements to be used in further analysis. 
To be qualihed for comparing different manifolds, the synthesized manifolds is designed to encode 
interesting properties of any view-manifold such as: 

• Dimensionality (of the Euclidean space where the manifold lives) 

• Sparsity of the manifold 

• Smoothness of the manifold 

• Deformation of the manifold w.r.t the view-circle 

• Variance of data-points 

Recall, The view-circle is a view-manifold, where ah the viewpoints form a perfect circle and the 
object is assumed to be located at the center of this circle. In the rest of this section, we list detailed 
description of the synthetic manifolds. Then, we use them to analyze the selected measurements. 

C. 1 Dataset Description 

As in Eigure 7, manifolds in this dataset can be categories as: 

• Circle Orthogonally projected to high-dimensional subspaces (Manifold sets 1 and 2) 

• Unit circle projected to a nonlinear surface (manifold 3) 

• Unit circle projected to 3D-Sphere with radius r (5'0 (sets 4 an 5) 

• Nonlinear smooth curve projected on S'J (set 6) 
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(a) Sinosoidal Surface 


(b) Circle projected on S 2 (c) Circle projected on S 2 



Figure 7: Manifold Visualization 

• Discontinuous smooth curve projected on S 2 (set 7) 

• Random manifolds (sets 8 and 9 ) 

• Collapsed manifolds in a single point or very small region (set 10). 

The manifolds are described using the dimensionality (d), sparsity (<§ = §) and n, the number of 
points representing the view-manifold, smoothness, deformation w.r.t the view-circle. 

Let the view-manifold be parameterized by the single dimensional variable. Let S is the two dimen¬ 
sional representation of the unit circle. S = {(cos(t), sin{t))\t = {0,^,^,..., }}• For 

each view-manifold {M), we generated n points in a d-Dim space. 

• Perfect view-circle in high-dimensional space 

- Manifold 1: n = 100, d G {10, 300,600, 900,1200,1500,1800}, therefore, the spar¬ 
sity varies from very dense (s = 10) to very sparse (s = 1/20) 

- Manifold 2: d = 500, n G {50,150,250,350,450,550,650,750}, therefore, the 
manifold varies from very sparse (s = 1/10) to dense (s = 1.5) 

• View-circle projected nonlinearly to Sinusoidal Surface. The manifold has n = 100 

points and live ind = 3-Dim space, so it is very dense s = 33.33 To project the view-circle 
on this surface we follow these steps: 

- Let fn be the projection function on the surface, 

fn{x^y) = sin{3x)cos{2y)‘^ 

- The projected manifold Z is dehned by 

^ = {ix,y,fn{x,y))\{x,y) € S} 

- Manifold 3 represents this type of manifolds in our dataset. 

• Dense view-circle projected nonlinearly to S'J? ^^th r G {1, 50,100,150}, d = 3, n = 

100 (s = 33.33) 

To project the view-circle on S 2 , we use the following projection function 

fiO^cj)) = {sin{(j))cos{0)^ sin{(j))sin{0)^cos{(j))) 

Where 

n n n 

- Manifold 4: Slightly deformed manifold. Figure 7b 

(f) = jsin{e) + |;V6» 

- Manifold 5: Slightly deformed manifold with added Gaussian noise with /i = 0 mean 
(j = 0.01. 0 and 0 as in Manifold 4. 

- Manifold 6: Highly deformed manifold. Figure 7c 

(j) = ^sm(56») + |;V6' 
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- Manifold 7: Highly deformed and broken/discontinuous manifold. 

(j) = ^tan{0.7b0) + V6> 

• Random manifold with independent dimensions 

- Manifold 8: Uniform random points with d G {10,100, 500,1000,4000}, n = 100, 
therefore, the sparsity varies from very dense (s = 10) to very sparse (s = 1/100) 

- Manifold 9: Normal random points has been generated with d = 100, n G 
{20,40,..., 200}, therefore, sparsity varies from very sparse (s = 1/10) to dense 

(5 = 2) 

• Collapsed Manifold with random noise 

- Manifold 10: Portion of the points (m = n/4), in this manifold, have been generated 
by Gaussian Random with /i = 0, cr = 0.01, therefore, the rest are a copied version of 
this portion d G {10,100, 500,1000,4000}, n = 100 


C.2 Analysis 

Recall, the objective of using the synthetic data is to verify the efficiency of selected measurements. 
Figure 8 shows the results of applying the measurements to the synthetic-data. Figure 8a shows 
the Nuclear Norm (dehned in the main paper) for all manifolds. This hgure shows the variability 
between the manifolds in the variance. For the set of manifolds 4-7, projecting the view-circle onto 
sphere with different sizes affects the variance of the points. Encoding different Nuclear Norm is 
subjected to discover the measurements that are sensitive to the data variance. 

From Figure 8b, dehned in the main paper, we can see the effective dimensions for each manifold. 
Manifolds 1 and 2 have two effective dimensions. Manifolds 3-7 has three effective dimensions. 
Since the points in Manifolds 8-10 are generated randomly so they have maximum rank. 

The kernel alignment measures: KTA (Figure 8c) and HISIC (Figure 8d) measure the correlation 
between the view-manifold and the view-circle. These two hgures show signihcant better alignment 
of the view-manifold of sets 1-6 than the alignment of the random manifolds. Since Hilbert-Schmidt 
Independence Criterion (HSIC) Gretton et al. (2005a) does not add any information more than KTA. 
We select the KTA measurement because it exposes absolute alignment conhdence for the manifolds 
1 and 2. 

KPLS-regression Error is shown in Figure 8e. Dispite the vast variability of variance and dimen¬ 
sionality, this measure is consistent and gives small value for ah smooth manifold. This measure 
can also detect the collapsing manifolds, since it gives very large error value. 

As we mentioned in Section D, that using both measurements KPLS-Regression Error and KPLS- 
Norm Ratio gives more robust conclusion about the manifold. Fig 8f shows a clear trend, since it 
gives signihcant high values for random manifolds. This is because, the subspace of the random 
points covers the entire space. When = 1, this means that the hrt d components extracted 

from Go are far from being principal. If they are pricipal components, they would change the energy 
of the matrix G signihcantly. On the other side, the Effecive dimensionality of the smooth manifolds 
1-6 is D < 3, which make the limit d > D. That is why the ratio C because we have 

extracted all the pricipal components of those manifolds. That is why KPLS-Regression Error for 
these manifolds is very small. 

As mentioned in the main paper, TPS-lineairty measure (TPS — RCond(CF — Poly)) scores on 
the stability of the polynomial mapping from the points on the view-circle and the points on the 
view-manifold. Fig 8g shows perfect scoring for Manifolds 1 and 2. Combining this hgure with 
Fig 8h gives a complete impression about the mapping stability (Polynomial and Non-Polynomial). 
However, the range of the values of TPS-nonlinearity measure (TPS — RCond(CF — nonPoly)) 
is in BigO(l^~^), which decrease its robustness. 
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(a) Nuclear Norm 


(b) Effective 90% SV’s 


(c) KTA 



Figure 8: Measurement analysis for the synthetic manifolds. Every hgure shows single measure¬ 
ment. X-axis is labeled by the manifold category number. 

D More on KPLS and TPS related Measurements 


We dehne and show the results of more measurements such as KPLS-Norm Ratio and TPS- 
nonPolynomial. We use here the same notations and dehnitions stated in Section 4 in the main 
paper. 


1) KPLS-Norm Ratio: Kernel Partial Least Squares (KPLS) Rosipal & Trejo (2002) is a supervised 
regression method. KPLS iteratively extracts a set of principal components of the input kernel that 
are most correlated with the output. While KPCA extracts the principal components (PCs) of the 
kernel of the input data to maximize the variance of the output space, KPLS extracts the PCs of 
the kernel of the input data that maximize the correlation with the output data. We use KPLS to 
map the affinity matrix of the transformed view-manifold (view-kernel) to the circle affinity matrix 
(circle-kernel). Lohowing the convention of the main paper, let the view-kernel is denoted by 
and the circle-kernel is denoted by K° (The subscript n is removed to simplify the notation). We 
limit the number of extracted PCs to d , where d N and N is the dimensionality of the input 
kernel (in this work, we use d = 5). More specihcahy, KPLS maps the rows of to the rows of 
K°. So that 


K° = GoU(T'^GoU)“1t'^K* 


Where the set of extracted PCs are the columns of the matrix T/vx 
the Gram-matrix Gq is dehned by 


Go = 


66T 


( 2 ) 

t^jvxd is auxiliary matrix, and 


(3) 
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Where b G , so that b{i) is the Frobenius norm of the i-th row of Based on the mapping in 
Eq 2, we extract two measurements: 

First: CTLS-Regression Error ((5) which measures geometric deformation of the generated output 
image of view-kernel in the circle-kernel space (K° with respect to the the circle-kernel (K°) and 
the ). One choice for measuring this is 

6{K°, K°) = l- KTA{K°, K°) 

Where KTA stands for Kernel Target Alignment (stated in Equation 2 in the main paper). The 
Regression error measures the reconstruction error of the circle-kernel from the view-kernel. 

Second, KPLS-NormK Ratio measure the residual energy after extracting the hrst d-PC’s. 

Where Gd is the residual of Go after d-iterations. The intuition behind this measure is that the 
larger the ratio ||^^||^ , this means that the view-manifold has more than d-PC’s correlated with the 
circle-kernel. 

While KPLS-regression Error is self-explanatory (this measure presented in the main paper), using 
the two KPLS measurements together gives more precise view on the correlation between the view- 
manifold and the circle-manifold. From Fig 9b, KPLS-Norm Ratio supports the observation that 
we noted in the main paper, from Fig 9a, that the lower layers in ModelO are more correlated to 
the circle-manifold than the higher layers. Except for Pool5, which encodes maximum correlation 
between the view-manifold and the circle-manifold. 


2) TPS-nonlinearity measure: In this measure we learn a regularized Thin Plate Spline (TPS) non¬ 
linear mapping Duchon (1977) between the unit circle manifold and each manifold A4^. The map¬ 
ping function (7) can be written as 

7''(x) = C'' ■ V’(x), 

where Cdx{N-\-e-\-i) is the mapping matrix, e = 2, and the vector 2 p{x) = [0(|x — zi\) - • 0(|x — 
zm\)^ 1 , represents a nonlinear kernel map from the conceptual representation to a kernel in¬ 
duced space. The thin plate spline is dehned as: 0(r) = and {zi}f£^ are the set of center points. 
The solution for can be obtained by directly solving the linear system: 


+ AI P, \ ^kT \ 

V 0(e+l)x(e + l)/ \0(e+l)X(i/ 


(4) 


A, Px and are dehned for the k — th set of object images as: A is sl x M matrix with 
K.\j = 0(|x^ — Zj\)G = • • 5 = i-G • • 5 ^5 Pcc is a A/c X (e -h 1) matrix with i-th row 

[1, x^^], Pt is M X (e + 1) matrix with i-th row [1, zj]. is sl x d matrix containing the set 
of images for manifold Af i.e. = [Yi , • * * 5 y^^]- Solution for is guaranteed under certain 
conditions on the basic functions used. 


The reason for using TPS in particular is that the mapping has two parts, an affine part (linear poly¬ 
nomial) and a nonlinear part. Inquiring into the two parts gives an impression about the mapping, if it 
is mostly linear or nonlinear. We used the reciprocal-condition number (RCond) of the submatrices 
of the coefficient matrix that correspond to the affine and the nonlinear part. 

While Pig 9c shows that the lower layers has more (better) conditioned linear mapping. Pig 9d 
shows that the lower layer has complete stable mapping. This is expected since the lower layers 
have high dimensionality. At the same time. Pig 9d shows that the Convolution layers (Conv 3,4 and 
5) have unstable nonlinear mappings. An additional observation is that hne-tuning against the pose 
labels increases the mapping stability (polynomial and non-polynomial). It is clear in Pig 9d that the 
TPS — RCondiCF — nonPoly) has very small order of values (10“^^), therefore, we do not rely 
on it in our analysis. 
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(a) KPLS-Regression Error ((5) 


(b) KPLS-Norm Ratio 

^ ^ ^ norm{CjrQ) 




(c) T^S-RCondiCF - poly) (d) TPS-i?Cond(CF - nonPoly) 




Figure 9: Measurement analysis for the view-manifold in RGBD dataset based on features extracted 
from different layers of several CNN models. Every hgure shows single measurement. Multiple 
lines is for different CNN model. X-axis is labeled by the layers. 
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